DS5001 Final Project

Quinton Mays (rub9ez)

Fall 2021

This project utilized data from the provided newzy dataset which contained text from news articles dating from 2014 to 2020.

Data Import and Exploration

The necessary libraries are imported for processing the texts from their as received, raw format (F0) into their Standard Text Analytic Data Model (F2) and to the Machine Learning Corpus Format(F1) formats.

Next, the data is imported from a .csv file. This file uses the | character as a delimeter, so that argument is supplied to the pandas method.

To begin exploratory data analysis, content length was explored, as the data description indicated that only some of the sources contained full documents. A logarithmically scaled content length (number of characters) is shown below:

From the histogram, it is very clear that the majority of the document content is very short, possibly only a headline. While this data could be explored, longer documents were chosen for analysis and therefore only content over 1000 characters is selected. Next, as the change in news coverage over time is of interest to this analysis, the temporal dimension of the data is explored:

The plot above indicates that the years 2016 and 2017 are of interest for analysis, as they contain the largest number of articles and the widest range of sources. These years are also of particular interest due to the political nature of the three dominant sources, PowerLine, Guardian, and Daily Kos, and the 2016 United States Presidential Election which occurred during this time.

Therefore, the dataset is limited to articles of more than 1000 characters, from the years 2016 and 2017, and from the sources PowerLine, Guardian, and Daily Kos. This also provides one source considered to be right leaning, left leaning, and neutral.

Next, to balance the number of articles in the dataset between sources and to reduce computation times, a number of articles from each month and source in the dataset was sampled. To ensure the maximum dataset size while maintaining an equal number of articles from each source per year, the sample size was set to one-third of the minimum number of documents from any source in any year:

The new distribution of articles is shown below:

Creating TOKEN, LIB, and VOCAB Tables

The LIB and TOKEN tables are created using data from the original dataframe:

LIB Table

The LIB table does not require any immediate annotation, so some exploratory analysis is performed:

As many of the articles from these sources are political in nature, I then explored the headlines for topics that were prevalent in the news during the chosen time period:

TOKEN Table

To begin constructing and annotating the TOKEN table, I drop any NAs from the TOKEN table:

Then set its index to doc_id:

The preliminary token table with raw document content is shown below:

Next, the doc_content column is split into sentences using nltk's sent_tokenize method:

The sentences generated above are then tokenized into words using nltk's WhitespaceTokenizer method and then labeled with part-of-speech using pos_tag from nltk:

The generated part of speech tuples are then split into their respective columns and the tuple column is removed:

The TOKEN table with the part-of-speech (POS) is shown below:

Now that the TOKEN table had been created, an ordered hierarchy of content objects can be defined. For this project, it is useful to define the OHCO of:

  1. Document (doc_id)
  2. Sentence (sent_num)
  3. Token (token_num)

This OHCO reflects the fact that none of the articles contain paragraphs or other structural objects.

A term string is defined from the token string by removing all punctuation from tokens:

All blank or NaN terms are then dropped from the TOKEN table:

VOCAB Table

The VOCAB table is created from the TOKEN table by finding and counting occurrences of all unique TOKENS in the TOKEN table.

It is then annotated with a binary indicator of if each term string is numeric in nature:

Annotating TOKEN and VOCAB Tables

Next, an annotation to indicate if each word in the VOCAB table is a stop word is added using nltk's stopwords corpus:

Finally, each item in the VOCAB table is stemmed using the PorterStemmer method from nltk:

After the VOCAB and TOKEN tables are completed, information from each of them can be transferred to the other. To map the TOKEN table to the VOCAB table, the unique term_id from the VOCAB table is added to the TOKEN table. Next, the most frequent POS can be added to the VOCAB table by counting the number of times a word is used for each POS and mapping the maximum count value to the VOCAB table:

Bag of Words and TFIDF

As documents can be represented as a bag of words, each article is converted to a bag of words representation using the article itself as a bag:

Using this bag of words, a document term count matrix is created:

The document term count matrix can be used to generate a TFIDF value for each token. To begin each term;s frequency is calculated:

Next, a document frequeny is generated:

To calculate the inverse document frequency, the number of documents is needed. In this case, it is equal to the number of rows in the document term count matrix. As seen in the plots in the exploratory data analysis section, there are 480 articles in the analyis corpus:

The inverse document frequency is then calculated using N

TF and IDF are then combined to calculate TFIDF:

Annotating Tables with TFIDF

First, the bag of words (BOW) table is annotated using both TF and TFIDF, the term string and highest occurrence part of speech are also added from the VOCAB table:

The TFIDF_sum of each term in the VOCAB table is then calculated:

Creating the DOC Table

Next, a DOC table can be created using the chosen OHCO. In this case, the OHCO is only the document level, as none of the articles contain paragraphs. To begin creating the doc table, the TFIDF table is grouped by OHCO and then TFIDF is aggregated to its mean for each document:

Next, the doc table is created from the TFIDF doc_id column:

A new index, doc_num, is created based on the row number:

The document title and source are added to DOC from the LIB table:

Clustering Algorithms

To cluster the articles the L0, L1, and L2 norms are calculated for use in determining distances between documents:

To compare distances between documents, a PAIRS dataframe is created using all unique combinations of documents:

Next, 5 different types of distance values:

are calculated:

Hierarchical Clustering

The distances between documents can be visualized using hierarchical clustering. First, the necessary libraries for creating the dendrograms are imported:

Next, a function is defined to plot the hierarchical clustering dendrograms is created. This function truncates the dendrograms to 5 levels due to the high number of leaves in the tree.

The first dendrogram created utilizes the cosine distance between each pair of documents as its measure. The color threshold identifies 3 distinct clusters in this analysis. It is also of note that the algorithm identifies three distinct groups, as there are three distinct sources.

To explore if the algorithm discriminated between the three sources in the corpus, the labels are changed to the document source:

It appears that the algorithm has partially discriminated between the three sources, however the majority of leaves reside in the top portion of the tree, and may not be as easily seperable.

Another measure of distance, Jensen-Shannon also appears to be effective at separating the news sources, and at creating more substantive groups of articles.

K-Means Clustering

Another type of clustering, K-Means clustering, can also be used to find groupings within the articles. K-Means clustering is accomplished using the KMeans method from Sci-Kit Learn.

The only parameter for K-Means is k, or the number of clusters to create. As there are three sources in our corpus, this parameter is set to k = 3:

Next, the algorithm is run on each of the calculated norms and the raw TFIDF:

The results of this clustering can be visualized using heatmaps, as shown below:

It appears that the algorithm does not discriminate between sources very well. Perhaps this is due to the fact that all three sources are reporting on the same events, so the content of the articles is similar enough to affect the clustering.

Principal Component Analysis

The next analysis technique to apply is Principal Component Analysis, a form of dimension reduction. This analysis is accomplished using the PCA method from Sci-Kit Learn and from calculating from the eigenvectors and values.

The TFIDF table is then normalized:

Using the normalized TFIDF table, a covariance matrix is constructed:

Next, the eigenvalues and eigenvectors are calculated using the eigh method from scipy.

The eigenvalues and eigenvectors are then converted into pandas DataFrames:

Each eigenvalue is then combined with its corresponding eigenvector and the explained variance of each term is calculated:

Next the principal components are added to a dataframe:

Then the loadings for each term are stored in a loadings dataframe:

Some of the terms associate with the psoitive and negative directions of the first and second principal components are shown below:

The first principal component, PC0, seems to be associated with the presidential election in the positive direction, and other topics in the negative direction. The second principal component, PC1, seems to be associated with the investigation into Russian interference in the 2016 election in the positive direction, and the GOP efforts to pass a tax reform and Obamacare repeal legislation in the negative direction.

The principal components also allow the data to be visualized along these axes. To do so, a function is defined to plot the articles by principal component:

To visualize the documents, they must be projected onto a new subspace.

The first and second principal components are visualized below:

Interestingly, the three sources are not very well separated using the first two principal components. The left and right-leaning sources, PowerLine and Daily Kos, have less separation in these directions than the Guardian does from either of these sources. However, these differences are very small, as the first and second principal components only account for a small amount of variance exmplained.

In addition to the method shown above, the principal components can also be acquired from SciKit Learn's PCA method:

The SciKit Learn method produces the same results as the first method, but much quicker.

Topic Modeling

The next area to explore in the corpus is topic modeling. This is accomplished using the Latent Dirichlet Allocation algorithm, as implemented in the LDA function from SciKit Learn.

To preprocess the data, SciKit Learn's CountVectorizer is used to convert the raw document content into a sparse vector. It takes the argument n_terms, or the number of top terms from each document to add to the vector.

The LatentDirichletAllocation method takes the following arguments:

Next, each article's nouns are stored in a dataframe containing the doc_id and the nouns from doc_content:

It is then passed to a count vectorizer to vectorize each row:

The LDA object is then initialized using the parameters specified above:

The LDA object can then be fit to the count vectorizer output to generate THETA or the Document-Topic matrix:

The LDA object can also be used to generate the PHI or Topic-Word matrix:

PHI can then be used to generate a matrix containing the top 10 words by topic:

The topics generated by Latent Dirichlet Allocation are very interesting, it appears that the corpus has several topics contained within it. Examples include:

The distribution of topics can also be visualized across the corpus. To do so, the sum of each topic can be taken across all documents from the THETA matrix. A visualization of the distribution of topics is shown below:

As expected, the topics pertaining to the 2016 United States Presidential election dominate the corpus, due to their massive importance to the American political news media during the time period examined. Examining which topics were covered in each news source is also of interest, and can be accomplished by adding the source information from the LIB table to a table containing the topics and distribution by source.

The first source, PowerLine, focuses much more on the election as it pertains to taxes than the other two sources. This may be due to its right-leaning political alignment, and the importance of those issues to the American right-wing parties during that time.

The second source, the Guardian, seems to focus more on the election as it pertains to business and news due to the presence of topic 13 as its most prevalent topic.

The third source, Daily Kos, has topic 19 as its second-most associated topic. This topic contains women and health, indicating that it may refer to articles about women's access to healthcare, which was an issue during the 2016 Presidential Election and its aftermath.

As with documents, topics can also be clustered and visualized using dendrograms. To do so, methods from Scipy and SciKit Learn are imported:

Next, a method for plotting the dendrograms is defined:

The similarities between topics are found by measuring the distances between topics in the PHI matrix. For this corpus, euclidean distance was used to define a TREE dataframe for plotting.

Labels are then generated from the SOURCES dataframe:

The dendrogram is then plotted:

The dendrogram shown above indicates that many of the topics are similar to one another. Interestingly, the distance between topics 19 and 6 is low, and both topics seem to relate to planks of the Democratic Party's 2016 campaign platform, namely climate change and women's rights.

Word Embeddings with Word2Vec

Next, Word Embeddings are generated using gensim's word2vec. They are then visualized using TSNE plots.

Again, doc_id, or the individual articles are used as the bag. The window size is also set to 10 for maximum context around each word.

The corpus for word2vec is generated using the bag to generate a list.

The model is then trained on the corpus using the following parameters:

The coordinates for each term are then stored in a dataframe

The model is then visualized using t-SNE for dimensionality reduction to 2 dimensions (1st and 2nd Principal Components).

There are many interesting groupings in the t-SNE plot above including:

Sentiment Analysis

The final analysis to be executed on this corpus is sentiment analysis. For this analysis the salex_nrc lexicon is used. It is loaded from a .csv file:

Next, the values imported from the codex are joined to the TOKEN table:

Individual dataframes are then created for each document source:

The mean values of each emotion are then plotted for each source:

Interestingly, all of the document sources share the same top emotion: trust. However, PowerLine differs from the Guardian and Daily Kos in that its second and third-most prevalent emotions are fear and anger, while the other two sources have anticipation and fear as their second and third-most prevalent emotions.

Another interesting area that can be explored through sentiment analysis is the change in document sentiment over time. To visualize these trends, each source dataframe is grouped by year and month and a 3-month rolling mean of sentiment is generated. The resulting plots are shown below:

As seen in the bar graphs above, all three sources have trust as the most prevalent emotion during the analyzed timeframe. However, trust seems to decline over time in both Guardian and Daily Kos articles, while PowerLine remains constant. PowerLine also appears to have slighly lower polarity than the other two sources.

Finally, sentiment can also be examined using Valence Aware Dictionary and sEntiment Reasoner or VADER. To do so the necessary packages are imported and a SentimentIntensityAnalyzer object is initialized.

Next, the analyzer is used to calculate psoitive, negative, neutral, and compound sentiment scores for each sentence in the corpus.

The dataframes are then sorted by doc_id, as lower doc_ids increase with doc_date.

A method to produce positive/negative, neutral, and compound sentiment plots for each source is created:

The plots are then visualized using a dictionary to map the name of the source to its dataframe:

The plots above, while not indexed by document date, can be seen as a visualization of sentiment over the progression through each corpus due to the fact that doc_id increases with time. Many of the trends in these plots are interesting, including the increase in positive sentiment over time in PowerLine articles coupled with the increase in negative sentiment in Guardian and Daily Kos articles around the same time period.

To conclude sentiment analysis, the three dataframes are concatenated together to form one sentiment dataframe.

Data File Export

Core Data Tables

Annotations to Core Data Tables

Principal Components

Topic Models

Word Embeddings

Sentiment Analysis